x
# Analysis and Visualization of Netflix content## Topics1.0 Importing packages and loading data2.0 Cleaning the data 2.1 Checking Null Values 2.2 Visualizing Null Values 2.3 Treating Null Values3.0 Composition and Comparison Visualisations 3.1 Visualizing Composition of Content Type (Movies & TV Shows) 3.2 Countries producing maximum content on Netflix 3.3 Country-wise Composition of Content Type 3.4 Visualizing Composition of Content Rating 3.5 Qualitative Distribution of Content Type Across Maturity Ratings 3.6 Quantitative Distribution of Content Type Across Maturity Ratings 3.7 Count of Maturity Ratings for each Content Type 3.8 Composition of Content Ratings 4.0 Evolution of Netflix Content & its Type over time5.0 Study of Genres Correlations6.0 Distribution of target audiences for each country 6.1 Studying the gap between release and upload of content in different countries 6.2 Comparing the netflix content of USA & India7.0 Word Cloud8.0 ML Classification Model9.0 Interpreting the results1.0 Importing packages and loading data
2.0 Cleaning the data
2.1 Checking Null Values
2.2 Visualizing Null Values
2.3 Treating Null Values3.0 Composition and Comparison Visualisations
3.1 Visualizing Composition of Content Type (Movies & TV Shows)
3.2 Countries producing maximum content on Netflix
3.3 Country-wise Composition of Content Type
3.4 Visualizing Composition of Content Rating
3.5 Qualitative Distribution of Content Type Across Maturity Ratings
3.6 Quantitative Distribution of Content Type Across Maturity Ratings
3.7 Count of Maturity Ratings for each Content Type
3.8 Composition of Content Ratings4.0 Evolution of Netflix Content & its Type over time
5.0 Study of Genres Correlations
6.0 Distribution of target audiences for each country
6.1 Studying the gap between release and upload of content in different countries
6.2 Comparing the netflix content of USA & India7.0 Word Cloud
8.0 ML Classification Model
9.0 Interpreting the results
xxxxxxxxxx## 1.0 Importing packages and loading dataImport all the packages and load the required data downloaded using Kaggle.Import all the packages and load the required data downloaded using Kaggle.
import randomimport matplotlibimport numpy as npimport pandas as pdimport seaborn as snsfrom PIL import Imageimport plotly.express as pxfrom sklearn import metricsimport matplotlib.colorsimport matplotlib.pyplot as pltimport matplotlib.lines as linesfrom wordcloud import WordCloudfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.preprocessing import MultiLabelBinarizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics import accuracy_score, classification_report, confusion_matrixdf = pd.read_csv('netflix_titles.csv')dfxxxxxxxxxx## 2.0 Cleaning the dataAll the found null values will be handled below.All the found null values will be handled below.
#checking for null valuesdf.isnull().sum()sns.heatmap(df.isnull(),cmap = 'viridis')xxxxxxxxxx### 2.3 Treating Null ValuesWe have null values in director, cast,country,date_added and rating.So lets deal with it.We have null values in director, cast,country,date_added and rating.So lets deal with it.
df['rating'].value_counts().unique()xxxxxxxxxxWe can remove the director and cast columns from the above data because they don't play a big role in how we visualise the data and don't add much value to our analysis. We are only interested in visualising this data, so removing two columns will not be a problem. However, this should not be done on a regular basis because if we are developing a recommender system, we cannot remove the director and cast of a film because these are important features used to recommend movies to users.We can remove the director and cast columns from the above data because they don't play a big role in how we visualise the data and don't add much value to our analysis. We are only interested in visualising this data, so removing two columns will not be a problem. However, this should not be done on a regular basis because if we are developing a recommender system, we cannot remove the director and cast of a film because these are important features used to recommend movies to users.
df.drop(['director','cast'],axis = 1,inplace = True)df.head()xxxxxxxxxxWe replaced all of the Nan values in the country column with United States because Netflix was founded in the United States and all shows are available on Netflix US. So, instead of deleting the entire column, we simply replaced the values in it to save our data.We replaced all of the Nan values in the country column with United States because Netflix was founded in the United States and all shows are available on Netflix US. So, instead of deleting the entire column, we simply replaced the values in it to save our data.
df['country'].replace(np.nan, 'United States',inplace = True)xxxxxxxxxxWe already know the release year for each film, so even if we don't know the release date, it won't have much of an impact on our analysis. As a result, we can remove the column containing the release date.We already know the release year for each film, so even if we don't know the release date, it won't have much of an impact on our analysis. As a result, we can remove the column containing the release date.
df.head()df['rating'].value_counts()df['listed_in'].value_counts()xxxxxxxxxxAs we can see, our rating column only has ten missing values, which we can either drop or replace. Because TV-MA is the most common raing, we can substitute it for all of these nan values.As we can see, our rating column only has ten missing values, which we can either drop or replace. Because TV-MA is the most common raing, we can substitute it for all of these nan values.
df['rating'].replace(np.nan, 'TV-MA',inplace = True)df.isnull().sum()xxxxxxxxxxNow that we've dealt with all of our missing data, let's get started on our data visualisation.Now that we've dealt with all of our missing data, let's get started on our data visualisation.
df.head()xxxxxxxxxx## 3.0 Composition and Comparison Visualisations#looking at number of Movies and TV showssns.countplot(x='type',data = df, color= '#b20710')xxxxxxxxxx### 3.1 Visualizing Composition of Content Type (Movies & TV Shows)# For viz: Ratio of Movies & TV showsx=df.groupby(['type'])['type'].count()y=len(df)r=((x/y)).round(2)mf_ratio = pd.DataFrame(r).Tfig, ax = plt.subplots(1,1,figsize=(6.5, 2.5))ax.barh(mf_ratio.index, mf_ratio['Movie'], color='#b20710', alpha=0.9, label='Male')ax.barh(mf_ratio.index, mf_ratio['TV Show'], left=mf_ratio['Movie'], color='#221f1f', alpha=0.9, label='Female')ax.set_xlim(0, 1)ax.set_xticks([])ax.set_yticks([])#ax.set_yticklabels(mf_ratio.index, fontfamily='serif', fontsize=11)# movie percentagefor i in mf_ratio.index: ax.annotate(f"{int(mf_ratio['Movie'][i]*100)}%", xy=(mf_ratio['Movie'][i]/2, i), va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='serif', color='white') ax.annotate("Movie", xy=(mf_ratio['Movie'][i]/2, -0.25), va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='serif', color='white') for i in mf_ratio.index: ax.annotate(f"{int(mf_ratio['TV Show'][i]*100)}%", xy=(mf_ratio['Movie'][i]+mf_ratio['TV Show'][i]/2, i), va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='serif', color='white') ax.annotate("TV Show", xy=(mf_ratio['Movie'][i]+mf_ratio['TV Show'][i]/2, -0.25), va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='serif', color='white')# Title & Subtitlefig.text(0.125,1.03,'Movie & TV Show distribution', fontfamily='serif',fontsize=15, fontweight='bold')fig.text(0.125,0.92,'We see vastly more movies than TV shows on Netflix.',fontfamily='serif',fontsize=12) for s in ['top', 'left', 'right', 'bottom']: ax.spines[s].set_visible(False)#ax.legend(loc='lower center', ncol=3, bbox_to_anchor=(0.5, -0.06))# Removing legend due to labelled plotax.legend().set_visible(False)plt.show()xxxxxxxxxx### 3.2 Countries producing maximum content on Netflix# countries with the most rated contentcountry_count=df['country'].value_counts().sort_values(ascending=False) country_count=pd.DataFrame(country_count)topcountry=country_count[0:11]topcountryxxxxxxxxxx#### By CountrySo we now know there are much more movies than TV shows on Netflix (which surprises me!).What about if we look at content by country?I would imagine that the USA will have the most content. I wonder how my country, the UK, will compare?So we now know there are much more movies than TV shows on Netflix (which surprises me!).
What about if we look at content by country?
I would imagine that the USA will have the most content. I wonder how my country, the UK, will compare?
# Quick feature engineering# Helper column for various plotsdf['count'] = 1# Many productions have several countries listed - this will skew our results , we'll grab the first one mentioned# Lets retrieve just the first countrydf['first_country'] = df['country'].apply(lambda x: x.split(",")[0])df['first_country'].head()ratings_ages = { 'TV-PG': 'Older Kids', 'TV-MA': 'Adults', 'TV-Y7-FV': 'Older Kids', 'TV-Y7': 'Older Kids', 'TV-14': 'Teens', 'R': 'Adults', 'TV-Y': 'Kids', 'NR': 'Adults', 'PG-13': 'Teens', 'TV-G': 'Kids', 'PG': 'Older Kids', 'G': 'Kids', 'UR': 'Adults', 'NC-17': 'Adults'}df['target_ages'] = df['rating'].replace(ratings_ages)df['target_ages'].unique()# Genredf['genre'] = df['listed_in'].apply(lambda x : x.replace(' ,',',').replace(', ',',').split(',')) # Reducing name lengthdf['first_country'].replace('United States', 'USA', inplace=True)df['first_country'].replace('United Kingdom', 'UK',inplace=True)df['first_country'].replace('South Korea', 'S. Korea',inplace=True)data = df.groupby('first_country')['count'].sum().sort_values(ascending=False)[:10]# Plotcolor_map = ['#f5f5f1' for _ in range(10)]color_map[0] = color_map[1] = color_map[2] = '#b20710' # color highlightfig, ax = plt.subplots(1,1, figsize=(12, 6))ax.bar(data.index, data, width=0.5, edgecolor='darkgray', linewidth=0.6,color=color_map)#annotationsfor i in data.index: ax.annotate(f"{data[i]}", xy=(i, data[i] + 150), #i like to change this to roughly 5% of the highest cat va = 'center', ha='center',fontweight='light', fontfamily='serif')# Remove border from plotfor s in ['top', 'left', 'right']: ax.spines[s].set_visible(False) # Tick labelsax.set_xticklabels(data.index, fontfamily='serif', rotation=0)# Title and sub-titlefig.text(0.09, 1, 'Top 10 countries on Netflix', fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0.09, 0.95, 'The three most frequent countries have been highlighted.', fontsize=12, fontweight='light', fontfamily='serif')fig.text(1.1, 1.01, 'Insight', fontsize=15, fontweight='bold', fontfamily='serif')fig.text(1.1, 0.67, '''The most prolific producers ofcontent for Netflix are, primarily,the USA, with India and the UKa significant distance behind.It makes sense that the USA produces the most content as, afterall, Netflix is a US company.''' , fontsize=12, fontweight='light', fontfamily='serif')ax.grid(axis='y', linestyle='-', alpha=0.4) grid_y_ticks = np.arange(0, 4000, 500) # y ticks, min, max, then stepax.set_yticks(grid_y_ticks)ax.set_axisbelow(True)# thicken the bottom line if you want toplt.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)ax.tick_params(axis='both', which='major', labelsize=12)l1 = lines.Line2D([1, 1], [0, 1], transform=fig.transFigure, figure=fig,color='black',lw=0.2)fig.lines.extend([l1])ax.tick_params(axis=u'both', which=u'both',length=0)plt.show()xxxxxxxxxxAs predicted, the USA dominates.The UK is a top contender too, but still some way behind India.How does content by country vary?As predicted, the USA dominates.
The UK is a top contender too, but still some way behind India.
How does content by country vary?
xxxxxxxxxx### 3.3 Country-wise Composition of Content Typecountry_order = df['first_country'].value_counts()[:11].indexdata_q2q3 = df[['type', 'first_country']].groupby('first_country')['type'].value_counts().unstack().loc[country_order]data_q2q3['sum'] = data_q2q3.sum(axis=1)data_q2q3_ratio = (data_q2q3.T / data_q2q3['sum']).T[['Movie', 'TV Show']].sort_values(by='Movie',ascending=False)[::-1]###fig, ax = plt.subplots(1,1,figsize=(15, 8),)ax.barh(data_q2q3_ratio.index, data_q2q3_ratio['Movie'], color='#b20710', alpha=0.8, label='Movie')ax.barh(data_q2q3_ratio.index, data_q2q3_ratio['TV Show'], left=data_q2q3_ratio['Movie'], color='#221f1f', alpha=0.8, label='TV Show')ax.set_xlim(0, 1)ax.set_xticks([])ax.set_yticklabels(data_q2q3_ratio.index, fontfamily='serif', fontsize=11)# male percentagefor i in data_q2q3_ratio.index: ax.annotate(f"{data_q2q3_ratio['Movie'][i]*100:.3}%", xy=(data_q2q3_ratio['Movie'][i]/2, i), va = 'center', ha='center',fontsize=12, fontweight='light', fontfamily='serif', color='white')for i in data_q2q3_ratio.index: ax.annotate(f"{data_q2q3_ratio['TV Show'][i]*100:.3}%", xy=(data_q2q3_ratio['Movie'][i]+data_q2q3_ratio['TV Show'][i]/2, i), va = 'center', ha='center',fontsize=12, fontweight='light', fontfamily='serif', color='white') fig.text(0.13, 0.93, 'Top 10 countries Movie & TV Show split', fontsize=15, fontweight='bold', fontfamily='serif') fig.text(0.131, 0.89, 'Percent Stacked Bar Chart', fontsize=12,fontfamily='serif') for s in ['top', 'left', 'right', 'bottom']: ax.spines[s].set_visible(False) #ax.legend(loc='lower center', ncol=3, bbox_to_anchor=(0.5, -0.06))fig.text(0.75,0.9,"Movie", fontweight="bold", fontfamily='serif', fontsize=15, color='#b20710')fig.text(0.81,0.9,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')fig.text(0.82,0.9,"TV Show", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')fig.text(1.1, 0.93, 'Insight', fontsize=15, fontweight='bold', fontfamily='serif')fig.text(1.1, 0.44, '''Interestingly, Netflix in Indiais made up nearly entirely of Movies. Bollywood is big business, and perhapsthe main focus of this industry is Moviesand not TV Shows.South Korean Netflix on the other hand is almost entirely TV Shows.The underlying resons for the difference in content must be due to market researchconducted by Netflix.''' , fontsize=12, fontweight='light', fontfamily='serif')import matplotlib.lines as linesl1 = lines.Line2D([1, 1], [0, 1], transform=fig.transFigure, figure=fig,color='black',lw=0.2)fig.lines.extend([l1])ax.tick_params(axis='both', which='major', labelsize=12)ax.tick_params(axis=u'both', which=u'both',length=0)plt.show()xxxxxxxxxxAs I've noted in the insights on the plot, it is really interesting to see how the split of TV Shows and Movies varies by country.South Korea is dominated by TV Shows - why is this? I am a huge fan of South Korean cinema so I know they have a great movie selection.Equally, India is dominated by Movies. I think this might be due to Bollywood - comment below if you have any other ideas!As I've noted in the insights on the plot, it is really interesting to see how the split of TV Shows and Movies varies by country.
South Korea is dominated by TV Shows - why is this? I am a huge fan of South Korean cinema so I know they have a great movie selection.
Equally, India is dominated by Movies. I think this might be due to Bollywood - comment below if you have any other ideas!
xxxxxxxxxx### 3.4 Visualizing Composition of Content Ratingplt.figure(figsize = (12,8))sns.countplot(x='rating',data = df, color='#b20710')xxxxxxxxxx### 3.5 Qualitative Distribution of Content Type Across Maturity Ratings#analysing the type, whether its a movie or a movie v/s the rating it hasplt.figure(figsize=(16,6))sns.scatterplot(x='rating',y='type',data = df)xxxxxxxxxx### 3.6 Quantitative Distribution of Content Type Across Maturity Ratingsxxxxxxxxxx## RatingsLet's briefly check out how ratings are distributedLet's briefly check out how ratings are distributed
order = pd.DataFrame(df.groupby('rating')['count'].sum().sort_values(ascending=False).reset_index())rating_order = list(order['rating'])mf = df.groupby('type')['rating'].value_counts().unstack().sort_index().fillna(0).astype(int)[rating_order]movie = mf.loc['Movie']tv = - mf.loc['TV Show']fig, ax = plt.subplots(1,1, figsize=(12, 6))ax.bar(movie.index, movie, width=0.5, color='#b20710', alpha=0.8, label='Movie')ax.bar(tv.index, tv, width=0.5, color='#221f1f', alpha=0.8, label='TV Show')#ax.set_ylim(-35, 50)# Annotationsfor i in tv.index: ax.annotate(f"{-tv[i]}", xy=(i, tv[i] - 60), va = 'center', ha='center',fontweight='light', fontfamily='serif', color='#4a4a4a') for i in movie.index: ax.annotate(f"{movie[i]}", xy=(i, movie[i] + 60), va = 'center', ha='center',fontweight='light', fontfamily='serif',color='#4a4a4a')for s in ['top', 'left', 'right', 'bottom']: ax.spines[s].set_visible(False) ax.set_xticklabels(mf.columns, fontfamily='serif')ax.set_yticks([]) ax.legend().set_visible(False)fig.text(0.16, 1, 'Rating distribution by Film & TV Show', fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0.16, 0.89, '''We observe that some ratings are only applicable to Movies. The most common for both Movies & TV Shows are TV-MA and TV-14.''', fontsize=12, fontweight='light', fontfamily='serif')fig.text(0.755,0.924,"Movie", fontweight="bold", fontfamily='serif', fontsize=15, color='#b20710')fig.text(0.815,0.924,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')fig.text(0.825,0.924,"TV Show", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')plt.show()xxxxxxxxxx### 3.7 Count of Maturity Ratings for each Content Typeplt.figure(figsize = (12,8))sns.countplot(x='rating',data = df,hue='type', color='#b20710')# distribution according to the ratingdf['rating'].value_counts().plot.pie(autopct='%1.1f%%',figsize=(20,35))plt.show()xxxxxxxxxx## 4.0 Evolution of Netflix Content & its Type over timexxxxxxxxxxHow has content been added over the years?As we saw in the timeline at the start of this analysis, Netflix went global in 2016 - and it is extremely noticeable in this plot.The increase is Movie content is remarkable.How has content been added over the years? As we saw in the timeline at the start of this analysis, Netflix went global in 2016 - and it is extremely noticeable in this plot.
The increase is Movie content is remarkable.
df["date_added"] = pd.to_datetime(df['date_added'])df['month_added']=df['date_added'].dt.monthdf['month_name_added']=df['date_added'].dt.month_name()df['year_added'] = df['date_added'].dt.yeardf.head(3)fig, ax = plt.subplots(1, 1, figsize=(12, 6))color = ["#b20710", "#221f1f"]for i, mtv in enumerate(df['type'].value_counts().index): mtv_rel = df[df['type']==mtv]['year_added'].value_counts().sort_index() ax.plot(mtv_rel.index, mtv_rel, color=color[i], label=mtv) ax.fill_between(mtv_rel.index, 0, mtv_rel, color=color[i], alpha=0.9) ax.yaxis.tick_right() ax.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)#ax.set_ylim(0, 50)#ax.legend(loc='upper left')for s in ['top', 'right','bottom','left']: ax.spines[s].set_visible(False)ax.grid(False)ax.set_xlim(2008,2020)plt.xticks(np.arange(2008, 2021, 1))fig.text(0.13, 0.85, 'Movies & TV Shows added over time', fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0.13, 0.59, '''We see a slow start for Netflix over several years. Things begin to pick up in 2015 and then there is a rapid increase from 2016.It looks like content additions have slowed down in 2020, likely due to the COVID-19 pandemic.''', fontsize=12, fontweight='light', fontfamily='serif')fig.text(0.13,0.2,"Movie", fontweight="bold", fontfamily='serif', fontsize=15, color='#b20710')fig.text(0.19,0.2,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')fig.text(0.2,0.2,"TV Show", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')ax.tick_params(axis=u'both', which=u'both',length=0)plt.show()plt.figure(figsize = (35,6))sns.countplot(x='release_year',data = df, color='#b20710')xxxxxxxxxxAs we can see, the majority of the movies and television shows on Netflix were released in the last decade, with only a few exceptions released earlier.As we can see, the majority of the movies and television shows on Netflix were released in the last decade, with only a few exceptions released earlier.
plt.figure(figsize=(12,6))df[df["type"]=="Movie"]["release_year"].value_counts()[:20].plot(kind="bar",color="#b20710")plt.title("Frequency of Movies which were released in different years and are available on Netflix")plt.figure(figsize=(12,6))df[df["type"]=="TV Show"]["release_year"].value_counts()[:20].plot(kind="bar",color="#b20710")plt.title("Frequency of TV shows which were released in different years and are available on Netflix")xxxxxxxxxx## What about a more interesting way to view how content is added across the year?Sometimes visualizations should be eye-catching & attention grabbing - I think this visual acheives that, even if it isn't the most precise.By highlighting certain months, the reader's eye is drawn exactly where we want it.Sometimes visualizations should be eye-catching & attention grabbing - I think this visual acheives that, even if it isn't the most precise.
By highlighting certain months, the reader's eye is drawn exactly where we want it.
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']df['month_name_added'] = pd.Categorical(df['month_name_added'], categories=month_order, ordered=True)data_sub2= df.groupby('type')['month_name_added'].value_counts().unstack().fillna(0).loc[['TV Show','Movie']].cumsum(axis=0).Tdata_sub2['Value'] = data_sub2['Movie'] + data_sub2['TV Show']data_sub2 = data_sub2.reset_index()df_polar = data_sub2.sort_values(by='month_name_added',ascending=False)color_map = ['#221f1f' for _ in range(12)]color_map[0] = color_map[11] = '#b20710' # color highlight# initialize the figureplt.figure(figsize=(8,8))ax = plt.subplot(111, polar=True)plt.axis('off')# Constants = parameters controling the plot layout:upperLimit = 30lowerLimit = 1labelPadding = 30# Compute max and min in the datasetmax = df_polar['Value'].max()# Let's compute heights: they are a conversion of each item value in those new coordinates# In our example, 0 in the dataset will be converted to the lowerLimit (10)# The maximum will be converted to the upperLimit (100)slope = (max - lowerLimit) / maxheights = slope * df_polar.Value + lowerLimit# Compute the width of each bar. In total we have 2*Pi = 360°width = 2*np.pi / len(df_polar.index)# Compute the angle each bar is centered on:indexes = list(range(1, len(df_polar.index)+1))angles = [element * width for element in indexes]angles# Draw barsbars = ax.bar( x=angles, height=heights, width=width, bottom=lowerLimit, linewidth=2, edgecolor="white", color=color_map,alpha=0.8)# Add labelsfor bar, angle, height, label in zip(bars,angles, heights, df_polar["month_name_added"]): # Labels are rotated. Rotation must be specified in degrees :( rotation = np.rad2deg(angle) # Flip some labels upside down alignment = "" if angle >= np.pi/2 and angle < 3*np.pi/2: alignment = "right" rotation = rotation + 180 else: alignment = "left" # Finally add the labels ax.text( x=angle, y=lowerLimit + bar.get_height() + labelPadding, s=label, ha=alignment, fontsize=10,fontfamily='serif', va='center', rotation=rotation, rotation_mode="anchor")x
Yes, December & January are definitely the best months for new content. Maybe Netflix knows that people have a lot of time off from work over this period and that it is a good time to reel people in?February is the worst - why might this be? Ideas welcomed!Yes, December & January are definitely the best months for new content. Maybe Netflix knows that people have a lot of time off from work over this period and that it is a good time to reel people in?
February is the worst - why might this be? Ideas welcomed!
top_rated=df[0:10]fig =px.sunburst( top_rated, path=['country'])fig.show()xxxxxxxxxx#### Movie GenresLet's now explore movie genres a little...Let's now explore movie genres a little...
# Custom colour map based on Netflix palettecmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#221f1f', '#b20710','#f5f5f1'])def genre_heatmap(df, title): df['genre'] = df['listed_in'].apply(lambda x : x.replace(' ,',',').replace(', ',',').split(',')) Types = [] for i in df['genre']: Types += i Types = set(Types) print("There are {} types in the Netflix {} Dataset".format(len(Types),title)) test = df['genre'] mlb = MultiLabelBinarizer() res = pd.DataFrame(mlb.fit_transform(test), columns=mlb.classes_, index=test.index) corr = res.corr() mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True fig, ax = plt.subplots(figsize=(10, 7)) fig.text(.54,.88,'Genre correlation', fontfamily='serif',fontweight='bold',fontsize=15) fig.text(.75,.665, ''' It is interesting that Independant Movies tend to be Dramas. Another observation is that Internatinal Movies are rarely in the Children's genre. ''', fontfamily='serif',fontsize=12,ha='right') pl = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, vmin=-.3, center=0, square=True, linewidths=2.5) plt.show()df_tv = df[df["type"] == "TV Show"]df_movies = df[df["type"] == "Movie"]genre_heatmap(df_movies, 'Movie')plt.show()plt.figure(figsize=(12,6))df[df["type"]=="Movie"]["listed_in"].value_counts()[:10].plot(kind="barh",color="#b20710")plt.title("Top 10 Genres of Movies",size=18)plt.figure(figsize=(12,6))df[df["type"]=="TV Show"]["listed_in"].value_counts()[:10].plot(kind="barh",color="brown")plt.title("Top 10 Genres of TV Shows",size=18)xxxxxxxxxx## 6.0 Distribution of target audiences for each countrydf_countries = pd.DataFrame(df.country.value_counts().reset_index().values, columns=["country", "count"])df_countries.head()# distribution of content on basis of countriesfig = px.choropleth( locationmode='country names', locations=df_countries.country, labels=df_countries["count"],)fig.show()xxxxxxxxxx#### Target AgesDoes Netflix uniformly target certain demographics? Or does this vary by country?Does Netflix uniformly target certain demographics? Or does this vary by country?
data = df.groupby('first_country')[['first_country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]data = data['first_country']df_heatmap = df.loc[df['first_country'].isin(data)]df_heatmap = pd.crosstab(df_heatmap['first_country'],df_heatmap['target_ages'],normalize = "index").Tfig, ax = plt.subplots(1, 1, figsize=(12, 12))country_order2 = ['USA', 'India', 'UK', 'Canada', 'Japan', 'France', 'S. Korea', 'Spain', 'Mexico']age_order = ['Kids','Older Kids','Teens','Adults']sns.heatmap(df_heatmap.loc[age_order,country_order2],cmap=cmap,square=True, linewidth=2.5,cbar=False, annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})ax.spines['top'].set_visible(True)fig.text(.99, .725, 'Target ages proportion of total content by country', fontweight='bold', fontfamily='serif', fontsize=15,ha='right')fig.text(0.99, 0.7, 'Here we see interesting differences between countries. Most shows in India are targeted to teens, for instance.',ha='right', fontsize=12,fontfamily='serif') ax.set_yticklabels(ax.get_yticklabels(), fontfamily='serif', rotation = 0, fontsize=11)ax.set_xticklabels(ax.get_xticklabels(), fontfamily='serif', rotation=90, fontsize=11)ax.set_ylabel('') ax.set_xlabel('')ax.tick_params(axis=u'both', which=u'both',length=0)plt.tight_layout()plt.show()xxxxxxxxxxVery interesting results.It is also interesting to note similarities between culturally similar countries - the US & UK are closey aligned with their Netflix target ages, yet vastly different to say, India or Japan!Very interesting results.
It is also interesting to note similarities between culturally similar countries - the US & UK are closey aligned with their Netflix target ages, yet vastly different to say, India or Japan!
xxxxxxxxxx### 6.1 Studying the gap between release and upload of content in different countriesxxxxxxxxxx#### Let's have a quick look at the lag between when content is released and when it is added on NetflixSpain looks to have a lot of new content. Great for them!Spain looks to have a lot of new content. Great for them!
### Relevant groupingsdata = df_movies.groupby('first_country')[['first_country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]data = data['first_country']df_loli = df_movies.loc[df_movies['first_country'].isin(data)]loli = df_loli.groupby('first_country')['release_year','year_added'].mean().round()# Reorder it following the values of the first valueordered_df = loli.sort_values(by='release_year')ordered_df_rev = loli.sort_values(by='release_year',ascending=False)my_range=range(1,len(loli.index)+1)fig, ax = plt.subplots(1, 1, figsize=(7, 5))fig.text(0.13, 0.9, 'How old are the movies? [Average]', fontsize=15, fontweight='bold', fontfamily='serif')plt.hlines(y=my_range, xmin=ordered_df['release_year'], xmax=ordered_df['year_added'], color='grey', alpha=0.4)plt.scatter(ordered_df['release_year'], my_range, color='#221f1f',s=100, alpha=0.9, label='Average release date')plt.scatter(ordered_df['year_added'], my_range, color='#b20710',s=100, alpha=0.9 , label='Average added date')#plt.legend()for s in ['top', 'left', 'right', 'bottom']: ax.spines[s].set_visible(False) # Removes the tick marks but keeps the labelsax.tick_params(axis=u'both', which=u'both',length=0)# Move Y axis to the right sideax.yaxis.tick_right()plt.yticks(my_range, ordered_df.index)plt.yticks(fontname = "serif",fontsize=12)# Custome legendfig.text(0.19,0.175,"Released", fontweight="bold", fontfamily='serif', fontsize=12, color='#221f1f')fig.text(0.76,0.175,"Added", fontweight="bold", fontfamily='serif', fontsize=12, color='#b20710')fig.text(0.13, 0.46, '''The average gap between when content is released, and when itis then added on Netflix variesby country. In Spain, Netflix appears to be dominated by newer movies whereas Egypt & India havean older average movie.''', fontsize=12, fontweight='light', fontfamily='serif')#plt.xlabel('Year')#plt.ylabel('Country')plt.show()xxxxxxxxxxWhat about TV shows...What about TV shows...
data = df_tv.groupby('first_country')[['first_country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]data = data['first_country']df_loli = df_tv.loc[df_tv['first_country'].isin(data)]loli = df_loli.groupby('first_country')['release_year','year_added'].mean().round()# Reorder it following the values of the first value:ordered_df = loli.sort_values(by='release_year')ordered_df_rev = loli.sort_values(by='release_year',ascending=False)my_range=range(1,len(loli.index)+1)fig, ax = plt.subplots(1, 1, figsize=(7, 5))fig.text(0.13, 0.9, 'How old are the TV shows? [Average]', fontsize=15, fontweight='bold', fontfamily='serif')plt.hlines(y=my_range, xmin=ordered_df['release_year'], xmax=ordered_df['year_added'], color='grey', alpha=0.4)plt.scatter(ordered_df['release_year'], my_range, color='#221f1f',s=100, alpha=0.9, label='Average release date')plt.scatter(ordered_df['year_added'], my_range, color='#b20710',s=100, alpha=0.9 , label='Average added date')#plt.legend()for s in ['top', 'left', 'right', 'bottom']: ax.spines[s].set_visible(False) ax.yaxis.tick_right()plt.yticks(my_range, ordered_df.index)plt.yticks(fontname = "serif",fontsize=12)fig.text(0.19,0.175,"Released", fontweight="bold", fontfamily='serif', fontsize=12, color='#221f1f')fig.text(0.47,0.175,"Added", fontweight="bold", fontfamily='serif', fontsize=12, color='#b20710')fig.text(0.13, 0.42, '''The gap for TV shows seemsmore regular than for movies.This is likely due to subsequentseries being releasedyear-on-year.Spain seems to havethe newest contentoverall.''', fontsize=12, fontweight='light', fontfamily='serif')ax.tick_params(axis=u'both', which=u'both',length=0)#plt.xlabel('Value of the variables')#plt.ylabel('Group')plt.show()xxxxxxxxxx### 6.2 Comparing the netflix content of USA & India xxxxxxxxxx#### USA & IndiaAs the two largest content countries, it might be fun to compare the twoAs the two largest content countries, it might be fun to compare the two
us_ind = df[(df['first_country'] == 'USA') | (df['first_country'] == 'India' )]data_sub = df.groupby('first_country')['year_added'].value_counts().unstack().fillna(0).loc[['USA','India']].cumsum(axis=0).Tfig, ax = plt.subplots(1, 1, figsize=(12, 6))color = ['#221f1f', '#b20710','#f5f5f1']for i, hs in enumerate(us_ind['first_country'].value_counts().index): hs_built = us_ind[us_ind['first_country']==hs]['year_added'].value_counts().sort_index() ax.plot(hs_built.index, hs_built, color=color[i], label=hs) #ax.fill_between(hs_built.index, 0, hs_built, color=color[i], alpha=0.4) ax.fill_between(hs_built.index, 0, hs_built, color=color[i], label=hs) ax.set_ylim(0, 1000)#ax.legend(loc='upper left')for s in ['top', 'right']: ax.spines[s].set_visible(False)ax.yaxis.tick_right()ax.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .4)#ax.set_ylim(0, 50)#ax.legend(loc='upper left')for s in ['top', 'right','bottom','left']: ax.spines[s].set_visible(False)ax.grid(False)ax.set_xticklabels(data_sub.index, fontfamily='serif', rotation=0)ax.margins(x=0) # remove white spaces next to marginsax.set_xlim(2008,2020)plt.xticks(np.arange(2008, 2021, 1))fig.text(0.13, 0.85, 'USA vs. India: When was content added?', fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0.13, 0.58, '''We know from our work above that Netflix is dominated by the USA & India.It would also be reasonable to assume that, since Netflix is an Americancompnany, Netflix increased content first in the USA, before other nations. That is exactly what we see here; a slow and then rapidincrease in content for the USA, followed by Netflix being launched to the Indian market in 2016.''', fontsize=12, fontweight='light', fontfamily='serif')fig.text(0.13,0.15,"India", fontweight="bold", fontfamily='serif', fontsize=15, color='#b20710')fig.text(0.188,0.15,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')fig.text(0.198,0.15,"USA", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')ax.tick_params(axis=u'both', which=u'both',length=0)plt.show()xxxxxxxxxxSo the USA dominates. Lemme do this other way around....So the USA dominates. Lemme do this other way around....
us_ind = df[(df['first_country'] == 'USA') | (df['first_country'] == 'India' )]data_sub = df.groupby('first_country')['year_added'].value_counts().unstack().fillna(0).loc[['USA','India']].cumsum(axis=0).Tdata_sub.insert(0, "base", np.zeros(len(data_sub)))data_sub = data_sub.add(-us_ind['year_added'].value_counts()/2, axis=0)fig, ax = plt.subplots(1, 1, figsize=(14, 6))color = ['#b20710','#221f1f'][::-1]hs_list = data_sub.columnshs_built = data_sub[hs]for i, hs in enumerate(hs_list): if i == 0 : continue ax.fill_between(hs_built.index, data_sub.iloc[:,i-1], data_sub.iloc[:,i], color=color[i-1]) for s in ['top', 'right', 'bottom', 'left']: ax.spines[s].set_visible(False)ax.set_axisbelow(True)ax.set_yticks([])#ax.legend(loc='upper left')ax.grid(False)fig.text(0.16, 0.76, 'USA vs. India: Stream graph of new content added', fontsize=15, fontweight='bold', fontfamily='serif')fig.text(0.16, 0.575, '''Seeing the data displayed like this helps us to realise just how much content is added in the USA.Remember, India has the second largest amount ofcontent yet is dwarfed by the USA.''', fontsize=12, fontweight='light', fontfamily='serif')fig.text(0.16,0.41,"India", fontweight="bold", fontfamily='serif', fontsize=15, color='#b20710')fig.text(0.208,0.41,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')fig.text(0.218,0.41,"USA", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')ax.tick_params(axis=u'y', which=u'both',length=0)plt.show()xxxxxxxxxx## 7.0 WordCloudWe have taken the title column to display the wordcloud. Instead of using the normal wordcloud, we have built the cloud in the logo of netflix.We have taken the title column to display the wordcloud. Instead of using the normal wordcloud, we have built the cloud in the logo of netflix.
# Custom colour map based on Netflix palettecmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#221f1f', '#b20710'])text = str(list(df['title'])).replace(',', '').replace('[', '').replace("'", '').replace(']', '').replace('.', '')mask = np.array(Image.open('netflix.JPG'))# print(mask)wordcloud = WordCloud(background_color = 'white', width = 500, height = 200,colormap=cmap, max_words = 150, mask = mask).generate(text)plt.figure( figsize=(5,5))plt.imshow(wordcloud, interpolation = 'bilinear')plt.axis('off')plt.tight_layout(pad=0)plt.show()xxxxxxxxxx## 8.0 ML Classification ModelWe have labels of 10 ratings, and the number of rows of data we have is very less so making a classifier using all the 14 labels would be very difficult.Hence, we need to reduce the number of labels to 3, we have taken the top two categories and remaining as other.We have labels of 10 ratings, and the number of rows of data we have is very less so making a classifier using all the 14 labels would be very difficult.
Hence, we need to reduce the number of labels to 3, we have taken the top two categories and remaining as other.
labels = df['rating'].unique()labels_count = df['rating'].value_counts()label_dict = {}for i in range(len(labels)): if labels[i] in labels_count[:2]: label_dict.update({labels[i]:i}) else: label_dict.update({labels[i]:10})# label_dict.update({'TV-PG': 5})label_dictdf[['rating_labels']] = df['rating'].map(label_dict)dfxxxxxxxxxxWe will be splitting the data manually because using train_test_split there is no guarantee that the data will be splitted qullay on ecah category.We will be splitting the data manually because using train_test_split there is no guarantee that the data will be splitted qullay on ecah category.
trains = []tests = []for i in df['rating_labels'].unique(): split_train = round(df[df['rating_labels'] == i].shape[0]*0.8) trains.append(df[df['rating_labels'] == i][:split_train]) tests.append(df[df['rating_labels'] == i][split_train:])train = pd.concat(trains, ignore_index= True)test = pd.concat(tests, ignore_index=True)X_train = train["description"]X_test = test["description"]y_train = train["rating_labels"]y_test = test["rating_labels"]print(X_train.shape)print(X_test.shape)print(y_train.shape)print(y_test.shape)xxxxxxxxxxTfiDfvectorizer has been used to convert the description labels to array of numbers whcih can be used in Naive Bayes model and Logistic Regression for training.TfiDfvectorizer has been used to convert the description labels to array of numbers whcih can be used in Naive Bayes model and Logistic Regression for training.
# instantiate the vectorizervectorizer = TfidfVectorizer()# fitvectorizer.fit(X_train)# transform training dataX_train_dtm = vectorizer.transform(X_train)# equivalently: combine fit and transform into a single step# this is faster and what most people would doX_train_dtm = vectorizer.fit_transform(X_train)# transform testing data (using fitted vocabulary) into a document-term matrixX_test_dtm = vectorizer.transform(X_test)# instantiate a Multinomial Naive Bayes modelnb = MultinomialNB()# using X_train_dtm (timing it with an IPython "magic command")%time nb.fit(X_train_dtm, y_train)# make class predictions for X_test_dtmy_pred_class = nb.predict(X_test_dtm)y_pred_prob = nb.predict_proba(X_test_dtm)# calculate AUCmetrics.roc_auc_score(y_test, y_pred_prob, multi_class='ovr')print(classification_report(y_test, y_pred_class))xxxxxxxxxx#### Logistic Regression is a lot slower when compared to Naive Baye's.# instantiate a logistic regression modellogreg = LogisticRegression()# train the model using X_train_dtm%time logreg.fit(X_train_dtm, y_train)# make class predictions for X_test_dtmy_pred_class = logreg.predict(X_test_dtm)# calculate predicted probabilities for X_test_dtm (well calibrated)y_pred_prob = logreg.predict_proba(X_test_dtm)# calculate AUCmetrics.roc_auc_score(y_test, y_pred_prob, multi_class='ovr')print(classification_report(y_test, y_pred_class))xxxxxxxxxxWe have tried to show different techniques of visualization which could help in keeping the audience engaged throughout the presentation. Plots helps in expressing our views better and making the people understand things nicely and very easily.We have tried to show different techniques of visualization which could help in keeping the audience engaged throughout the presentation. Plots helps in expressing our views better and making the people understand things nicely and very easily.